42        Bioinformatics

The FASTX-toolkit tools listed in Table 1.4 are used for quality assessment and quality

adjustment. The major limitation is that “fastq_quality_filter” of FASTX-toolkit does not

process the paired-end FASTQ files together and that usually results in singletons or reads

without pairs in any of the two paired-end FASTQ files. Most aligners do not accept to pro-

cess paired-end FASTQ files with singletons. The FASTX-toolkit solution to the singleton

problem is to mask the low-quality bases instead of removing the reads with low-quality

bases. Thus, “fastq_masker” program is used instead of “fastq_quality_filter” to mask the

bases of Phred quality score less than a user-defined threshold “-q”.

fastq_masker \

-q 20 \

-i bad.fastq \

-o bad_masked.fastq \

-Q33

fastqc bad_masked.fastq

firefox bad_masked_fastqc.html

The above “fastq_masker” command masks the bases with quality lower than 20 Phred

quality score “-q 20”; therefore, they will be ignored by aligners and assemblers.

For paired-end FASTQ files produced by an Illumina instrument, there is another

FASTQ processing program, developed by Illumina for paired-end FASTQ files, called

Trimmomatic [15]. It is a multithreaded command-line Java-based program and is more

modern than FASTX-toolkit. It was developed by Illumina to perform several operations,

including detection and removing the known adaptor fragments (adapter.clip), trim-

ming low-quality regions from the beginning of the reads (trim.leading), trimming low-

quality regions from the end of the reads (trim.trailing), filtering out short reads (min.

read.length), in addition to other operations with different quality-filtering strategies for

dropping low-quality bases in the reads (max.info and sliding.window). Trimmomatic can

be used in two modes: simple and palindrome modes. In the simple mode, for removing

adaptor sequences, the pairwise local alignment between adaptor sequence and reads is

used to scan reads from 5 ends to 3 ends using seed and extend approach. If a score of

a match exceeds a user-defined threshold, both the matched region and the region after

alignment will be removed. The entire read is removed if an alignment covers all the read.

The simple Trimmomatic approach may not be able to detect the short adaptor sequence.

Therefore, the palindrome model is used because it is able to detect and remove short

fragment sequences of adaptors. Palindrome is used only for the paired-end data. Both

forward and reverse reads will have equal number of valid bases and each read comple-

ments another. The valid reads are followed by the contaminating bases from the adaptors.

The tool uses the two complementary reads to identify the adaptor fragment or any other

contaminating technical sequence by globally aligning the forward and reverse reads. An

alignment score that is greater than a user-defined threshold indicates that the first parts

of each read reversely complement one another and the remaining read fragments which

match the adaptor sequence will be removed.